-
Notifications
You must be signed in to change notification settings - Fork 13
Introduce with_distributed_execution #209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… from anything that's not a SessionStateBuilder
| if cfg.network_coalesce_tasks.is_none() && cfg.network_shuffle_tasks.is_none() { | ||
| return Ok(plan); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As now DistributedConfig is always in the config, we cannot rely on it not being present as a signal that the query should not be distributed. Now we also need to check that it's not configured at all in order to skip the distribution.
| if plan.as_any().is::<DistributedExec>() { | ||
| return Ok(plan); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really related to this PR, but found this footgun while making the tests pass, so I thought that it might be fine to add it here, it would have helped me.
|
@adriangb any feedback on this? for us (DataDog) this API change makes sense, but let me know what you guys think |
|
I think we may need a bit finer grained control. While we're still in the stage of getting DATAFUSION_DISTRIBUTED_CHANCE=0.25 # run 25% of queries via `datafusion-distributed` by defaultThis sets the default for an -- SET extension.distributed_chance = 1.0
SELECT * FROM t LIMIT 10;The way we do this is currently an optimizer rule wrapper I think we could do something similar with this new system by tracking But overall I think it's a much nicer / simpler API for most if not all users 🥳 |
This PR refactors the public API people use for enriching a DataFusion
SessionStatewith distributed capabilities.Before, all the
with_distributed_*andset_distributed_*methods were applicable to a lot of DataFusion structs:SessionContextSessionConfigSessionStateSessionStateBuilderBut there's an ergonomic issue with that:
Users have no other option but to use
SessionStateBuilderfor adding configuring distributed capabilities, because that's what the only possible place in DataFusion to add aPhysicalOptimizerRuleimplementation (builder.with_physical_optimizer_rule()).This means that no matter how many ergonomic improvements we bring to other structs, people need to configure things through
SessionStateBuilder.This PR allows configuring distributed DataFusion with a new
with_distributed_execution()that will automatically:ChannelResolverimplementationDistributedConfigstruct in theConfigOptionsDistributedPhysicalOptimizerRuleas an optimization ruleAs these are all things that users need to do proactively one way or another, we now just expose it in one method:
This is a preparation PR for a bigger one: